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TITLE OF THE INVENTION 



SPEECH INFORMATION PROCESSING METHOD AND APPARATUS, AND 



FIELD OF THE INVENTION 

The present invention relates to a technique for 
synthesizing speech by using a speech segment dictionary. 

10 BACKGROUND OF THE INVENTION 

A speech synthesizing technique for synthesizing 
speech by using a computer uses a speech segment dictionary. 
This speech segment diet idha-ry*v''s-t ares speech segments in 
units (synthetic units) of speech segments, CV/VC, or VCV. 

15 To synthesize speech, appropriate speech segments are 

selected from this speech segment dictionary and modified 
and connected to generate desired synthetic speech. A flow 
chart in Fig. 15 explains this process. 



20 mixed text and the like are input. In step S132, the input 
speech contents are analyzed to obtain a speech segment 
symbol string {pO, pi, . . . } and parameters for determining 
prosody. The flow then advances to step S133 to determine 
the prosody such as the speech segment time length, 

25 fundamental frequency, and power. In speech segment 

dictionary look-up step S134, speech segments {w0, wl, . . . } 
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appropriate for the speech segment symbol string {p0, pi, ... } 
obtained by the input analysis in step S132 and the prosody 
obtained by the prosody determination in step S133 are 
retrieved from the speech segment dictionary. The flow 
advances to step S135, and the speech segments {w0, wl, . . . } 
obtained by the speech segment dictionary retrieval in step 
S134 are modified and concatenated to match the prosody 
determined in step S133. In step S136, the result of the 
speech segment modification and concatenation in step S135 
is output as a synthetic speech. 

Waveform editing is one effective method of speech 
synthesis. This method, e.g., superposes waveforms and 
changes pitches in synchronism with vocal cord vibrations. 
The method is advantageous in that synthetic speech close 
to a natural utterance can be generated with a small amount 
of arithmetic operations. When a method like this is used, 
a speech segment dictionary is composed of indexes for 
retrieval, waveform data (also called speech segment data) 
corresponding to individual speech segments, and auxiliary 
information of the data. In this case, all speech segment 
data registered in the speech segment dictionary are often 
encoded using the /x-law or ADPCM (Adaptive Differential 
Pulse Code Modulation) . 

The above prior art has the following problems. 

First, when all speech segment data registered in the 
speech segment dictionary are encoded by using an encoding 





scheme such as the M-law or A-law, no sufficient compression 
efficiency can be obtained since each speech segment data 
is nonuniformly quantized using a fixed quantization table. 
This is so because a quantization table must be so designed 
5 that a minimum quality can be maintained for all types of 
speech segments. 

Second, when all speech segment data registered in the 
speech segment dictionary are encoded using an encoding 
scheme such as ADPCM, the operation amount in decoding 
10 increases by the operation amount of an adaptive algorithm. 
This is so because the advantage (small processing amount) 
of the waveform editing method is impaired if a large 
operation amount is required for decoding. 

15 SUMMARY OF THE INVENTION 

The present invention has been made in consideration 
of the above prior art, and has as its object to provide a 
technique which very efficiently reduces a storage capacity 
necessary for a speech segment dictionary without degrading 

20 the quality of speech segments registered in the speech 
segment dictionary. 

Also, the present invention has been made in 
consideration of the above prior art, and has as its another 
object to provide a technique which generates natural, 

25 high-quality synthetic speech. 




To achieve the above objects, a speech information 
processing method of the present invention is a speech 
information processing method of generating a speech segment 
dictionary for holding a plurality of speech segments, 
5 characterized by comprising the selection step of selecting 
an encoding method of encoding a speech segment from a 
plurality of encoding methods, the encoding step of encoding 
the speech segment by using the selected encoding method, 
and the storage step of storing the encoded speech segment 
10 in a speech segment dictionary . 

A storage medium of the present invention is 
m characterized by storing a control program for allowing a 

Q computer to realize the above speech information processing 

method. 

15 A speech information processing apparatus of the 

present invention is a speech information processing 
apparatus for generating a speech segment dictionary for 
holding a plurality of speech segments, characterized by 
comprising selecting means for selecting an encoding method 
20 of encoding a speech segment from a plurality of encoding 
methods, encoding means for encoding the speech segment by 
using the selected encoding method, and storage means for 
storing the encoded speech segment in a speech segment 
dictionary. 

25 A speech information processing method of the present 

invention is a speech information processing method of 
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synthesizing speech by using a speech segment dictionary for 
holding a plurality of speech segments, characterized by 
comprising the selection step of selecting, from a plurality 
of decoding methods, a decoding method of decoding a speech 
5 segment read out from the speech segment dictionary, the 
decoding step of decoding the speech segment by using the 
selected decoding method, and the speech synthesizing step 
of synthesizing speech on the basis of the decoded speech 
segment . 

10 A storage medium of the present invention is 

characterized by storing a control program for allowing a 
computer to realize the above speech information processing 
method. 

A speech information processing apparatus of the 
15 present invention is a speech information processing 

apparatus for synthesizing speech by using a speech segment 
dictionary for holding a plurality of speech segments, 
characterized by comprising selecting means for selecting, 
from a plurality of decoding methods, a decoding method of 
20 decoding a speech segment read out from the speech segment 
dictionary, decoding means for decoding the speech segment 
by using the selected decoding method, and speech 
synthesizing means for synthesizing speech on the basis of 
the decoded speech segment . 
25 A speech information processing method of the present 

invention is a speech information processing method of 





generating a speech segment dictionary for holding a 
plurality of speech segments, characterized by comprising 
the setting step of setting an encoding method of encoding 
a speech segment in accordance with the type of the speech 
5 segment, the encoding step of encoding the speech segment 
by using the set encoding method, and the storage step of 
storing the encoded speech segment in a speech segment 
dictionary. 

A storage medium of the present invention is 
10 characterized by comprising a control program for allowing 
a computer to realize the above speech information processing 
method. 

A speech information processing apparatus of the 
present -invention is a speech information processing 

15 apparatus for generating a speech segment dictionary for 
holding a plurality of speech segments, characterized by 
comprising setting means for setting an encoding method of 
encoding a speech segment in accordance with the type of the 
speech segment, encoding means for encoding the speech 

20 segment by using the set encoding method, and storage means 
for storing the encoded speech segment in a speech segment 
dictionary. 

A speech information processing method of the present 
invention is a speech information processing method of 
25 synthesizing speech by using a speech segment dictionary for 
holding a plurality of speech segments, characterized by 




comprising the setting step of setting a decoding method of 
decoding a speech segment read out from the speech segment 
dictionary in accordance with the type of the speech segment, 
the decoding step of decoding the speech segment by using 
5 the set decoding method, and the speech synthesizing step 
of synthesizing speech on the basis of the decoded speech 
segment . 

A storage medium of the present invention is 
characterized by comprising a control program for allowing 
10 a computer to realize the above speech information processing 
method. 

A speech information processing apparatus of the 
present invention is a speech information processing 
apparatus for synthesizing speech by using a speech segment 

15 dictionary for holding a plurality of speech segments, 
characterized by comprising setting means for setting a 
decoding method of decoding a speech segment read out from 
the speech segment dictionary in accordance with the type 
of the speech segment, decoding means for decoding the speech 

20 segment by using the set decoding method, and speech 

synthesizing means for synthesizing speech on the basis of 
the decoded speech segment. 

Other features and advantages of the present invention 
will be apparent from the following description taken in 

25 conjunction with the accompanying drawings, in which like 
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reference characters designate the same or similar parts 
throughout the figures thereof. 

BRIEF DESCRIPTION OF THE DRAWINGS 
5 The accompanying drawings, which are incorporated in 

and constitute a part of the specification, illustrate 
embodiments of the invention and, together with the 
description, serve to explain the principles of the 
invention . 

10 Fig. 1 is block diagram showing the hardware 

configuration of a speech synthesizing apparatus according 
to each embodiment of the present invention; 

Fig. 2 is a flow chart for explaining a speech segment 
dictionary formation algorithm in the first embodiment of 
15 the present invention; 

Fig. 3 is a flow chart for explaining a speech synthesis 
algorithm in the first embodiment of the present invention; 

Fig. 4 is a flow chart for explaining a speech segment 
dictionary formation algorithm in the second embodiment of 
20 the present invention; 

Fig. 5 is a flow chart for explaining a speech synthesis 
algorithm in the second embodiment of the present invention; 

Fig. 6 is a flow chart for explaining a speech segment 
dictionary formation algorithm in the third embodiment of 
25 the present invention; 



Fig. 7 is a flow chart for explaining the speech segment 
dictionary formation algorithm in the third embodiment of 
the present invention; 

Fig. 8 is a flow chart for explaining a speech synthesis 
5 algorithm in the third embodiment of the present invention; 

Fig. 9 is a flow chart for explaining a speech segment 
dictionary formation algorithm in the fourth embodiment of 
the present invention; 

Fig. 10 is a flow chart for explaining a speech 
10 synthesis algorithm in the fourth embodiment of the present 
invention; 

Fig. 11 is a flow chart for explaining a speech segment 
dictionary formation algorithm in the fifth embodiment of 
the present invention; 
15 Fig. 12 is a flow chart for explaining a speech 

synthesis algorithm in the fifth embodiment of the present 
invention; 

Fig. 13 is a flow chart for explaining a speech segment 
dictionary formation algorithm in the sixth embodiment of 
20 the present invention; 

Fig. 14 is a flow chart for explaining a speech 
synthesis algorithm in the sixth embodiment of the present 
invention; and 

Fig. 15 is a flow chart showing a general speech 
25 synthesizing process . 

/o 



DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS 

Preferred embodiments of the present invention will 
be described in detail below with reference to the 
accompanying drawings. In these embodiments, (1) a method 
5 of forming a speech segment dictionary (a speech segment 
dictionary formation algorithm) and (2) a method of 
synthesizing speech by using this speech segment dictionary 
(a speech synthesis algorithm) will be described in detail. 
JS Fig. 1 is a block diagram showing an outline of the 

rj 10 functional configuration of a speech information processing 

tfs apparatus according to the embodiments of the present 

i; invention. A speech segment dictionary formation algorithm 

;L and a speech synthesis algorithm in each embodiment are 

j2 realized by using this speech information processing 

S S 

Nv 15 apparatus. 

O Referring to Fig. 1, a central processing unit (CPU) 

100 executes numerical operations and various control 
processes and controls operations of individual units (to 
be described later) connected via a bus 105. A storage device 

20 101 includes, e.g., a RAM and ROM and stores various control 
programs executed by the CPU 100, data, and the like. The 
storage device 101 also temporarily stores various data 
necessary for the control by the CPU 100 . An external storage 
device 102 is a hard disk device or the like and includes 

25 speech segment database 111 and a speech segment dictionary 
112. This speech segment database 111 holds speech segments 
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before registration in the speech segment dictionary 112 
(i.e., non-compressecl speech segments). An output device 
103 includes a monitor for displaying the operation statuses 
of diverse programs, a loudspeaker for outputting 
synthesized speech, and the like. An input device 104 
includes, e.g., a keyboard and a mouse. By using this input 
device 104, a user can control a program for forming the speech 
segment dictionary 112, control a program for synthesizing 
speech by using the speech segment dictionary 112, and input 
text (containing a plurality of character strings) as an 
object of speech synthesis. 

On the basis of the above configuration, a speech 
segment dictionary formation algorithm and a speech 
synthesis algorithm in each embodiment will be described 
below. 

[First Embodiment] 

A speech segment dictionary formation algorithm and 
a speech synthesis algorithm according to the first 
embodiment of the present invention will be described below 
by using the speech processing apparatus shown in Fig. 1. 

In the first embodiment, one of a plurality of encoding 
methods (more specifically, a 7-bit jtz-law scheme and an 8-bit 
M-law scheme) different in the number of quantization steps 
is selected for each speech segment to be registered in a 
speech segment dictionary 112. Note that a speech segment 
to be registered in the speech segment dictionary 112 is 
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composed of a phoneme, semi-phoneme, diphone (e.g., CV or 
VC) , VCV (or CVC) , or combinations thereof. 
(Formation of speech segment dictionary) 

Fig. 2 is a flow chart for explaining the speech segment 
5 dictionary formation algorithm in the first embodiment of 
the present invention. A program for achieving this 
algorithm is stored in a storage device 101. A CPU 100 reads 
out this program from the storage device 101 on the basis 

y3 of an instruction from a user and executes the following 

|j 10 procedure. 

H] In step S201, the CPU 100 initializes an index i, which 

m indicates each of N speech segment data (each speech segment 

J=~ data is non-compressed) stored in speech segment database 

ji? Ill of an external storage device 102, to "0". Note that this 

15 index i is stored in the storage device 101. 
O In step S202, the CPU 100 reads out ith speech segment 

data Wi indicated by this index i. Assume that the readout 
data Wi is 

Wi = {x0, xl, . . . , xT-1} 
20 where T is the time length (in units of samples) of Wi. 

In step S203, the CPU 100 encodes the speech segment 
data Wi read out in step S202 by using the 7-bit ji -law scheme . 
Assume that the result of the encoding is 
Ci = {c0, cl, . . . , cT-1} 
25 In step S204, the CPU 100 calculates encoding 

distortion p produced by the 7-bit /x-law encoding in step 
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S203. In this embodiment, a mean square error p is used as 
a measure of this encoding distortion. This mean square 
error p can be represented by 

p = (1/T)-Z(xt - M(7) _1 (ct)) 2 ...(1) 
5 where u (7) 1 ( ) is a 7-bit /z-law decoding function. In this 
equation, " E " is the summation from t-0tot=T- 1. 

In step S205, the CPU 100 checks whether the encoding 
distortion p calculated in step S204 is larger than a 
predetermined threshold value pO. If p > pO, the CPU 100 
10 determines that the waveform of the speech segment data Wi 
is distorted by encoding using the 7-bit //-law scheme. 
Therefore, in step S206 the CPU 100 switches the encoding 
method to the 8-bit /x-law scheme having a different number 
of quantization bits. In other cases, the flow advances to 
15 step S207. In step S206, the CPU 100 encodes the speech 
segment data Wi read out in step S202 by using the 8-bit //-law 
scheme. Assume that the result of the encoding is 

Ci = {c0, cl, . . . , cT-1 } 

In step S207, the CPU 100 writes encoding information 
20 of the phoneme data Wi and the like in the phoneme dictionary 
112. In addition to the encoding information, the CPU 100 
writes information necessary to decode the phoneme data Wi. 
This encoding information specifies the encoding method by 
which the speech segment data Wi is encoded: 
25 The encoding information is "0" if the encoding method 

is the 7-bit /x-law scheme 




The encoding information is "1" if the encoding method 
is the 8-bit /i-low scheme 

In step S208, the CPU 100 writes the speech segment 
data Wi encoded by one encoding scheme in the speech segment 
5 dictionary 112. In step S209, the CPU 100 checks whether the 
above processing is performed for all of the N speech segment 
data. If i = N - 1, the CPU 100 completes this algorithm. 
If not, in step S210 the CPU 100 adds 1 to the index i, the 
" flow returns to step S202, and the CPU 100 reads out speech 

ir! 10 segment data designated by the updated index i. The CPU 100 

Jt( repeatedly executes this processing for all of the N speech 

segment data. 

m 

In the speech segment dictionary formation algorithm 

O 

S3 of the first embodiment as described above, an encoding 

5=i 15 scheme can be selected from the 7-bit /x-law scheme and the 

p 8-bit M-law scheme for each speech segment to be registered 

in the speech segment dictionary 112. With this arrangement, 
a storage capacity necessary for the speech segment 
dictionary can be very efficiently reduced without 
20 deteriorating the quality of speech segments to be registered 
in the speech segment dictionary. Also, a larger number of 
types of speech segments than in conventional speech segment 
dictionaries can be registered in a speech segment dictionary 
having a storage capacity equivalent to those of the 
25 conventional dictionaries. 
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In the first embodiment, the aforementioned speech 
segment dictionary formation algorithm is realized on the 
basis of the program stored in the storage device 101. 
However, a part or the whole of this speech segment dictionary 
5 formation algorithm can also be constituted by hardware. 
(Speech synthesis) 

Fig. 3 is a flow chart for explaining the speech 
synthesis algorithm in the first embodiment of the present 
invention. A program for achieving this algorithm is stored 

10 in the storage device 101. The CPU 100 reads out this program 
on the basis of an instruction from a user and executes the 
following procedure . 

In step S301, the user inputs a character string in 
Japanese, English, or some other language by using the 

15 keyboard and the mouse of an input device 104. In the case 
of Japanese, the user inputs a character string expressed 
by kana-kanji mixed text. In step S302, the CPU 100 analyzes 
the input character string and obtains the speech segment 
sequence of this character string and parameters for 

20 determining the prosody of this character string. In step 
S303, on the basis of the prosodic parameters obtained in 
step S302, the CPU 100 determines prosody such as a duration 
length (the prosody for controlling the length of a voice) , 
fundamental frequency (the prosody for controlling the pitch 

25 of a voice) , and power (the prosody for controlling the 
strength of a voice) . 




In step S304, the CPU 100 obtains an optimum speech 
segment sequence on the basis of the speech segment sequence 
obtained in step S302 and the prosody determined in step S303 . 
The CPU 100 selects one speech segment contained in this 
speech segment sequence and retrieves speech segment data 
corresponding to the selected speech segment and encoding 
information corresponding to this speech segment data. If 
the speech segment dictionary 112 is stored in a storage 
medium such as a hard disk, the CPU 100 sequentially seeks 
to storage areas of encoding information and speech segment 
data. If the speech segment dictionary 112 is stored in a 
storage medium such as a RAM, the CPU 100 sequentially moves 
a pointer (address register) to storage areas of encoding 
information and speech segment data. 

In step S305, the CPU 100 reads out the encoding 
information retrieved in step S304 from the speech segment 
dictionary 112. This encoding information indicates the 
encoding method of the speech segment data retrieved in step 
S304: 

If the encoding information is "0", the encoding method 
is the 7-bit n -law scheme 

If the encoding information is "1", the encoding method 
is the 8-bit f± -law scheme 

In step S306, the CPU 100 examines the encoding 
information read out in step S305. If the encoding 
information is "0", the CPU 100 selects a decoding method 
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corresponding to the 7-bit /x-law scheme, and the flow 
advances to step S307 . If the encoding information is "1" , 
the CPU 100 selects a decoding method corresponding to the 
8-bit //.-law scheme, and the flow advances to step S309. 

In step S307, the CPU 100 reads out the speech segment 
data (encoded by the 7-bit /z-law scheme) retrieved in step 
S304 from the speech segment dictionary 112. In step S308, 
the CPU 100 decodes the speech segment data encoded by the 
7-bit £t-law scheme. 

On the other hand, in step S309 the CPU 100 reads out 
the speech segment data (encoded by the 8-bit At -law scheme) 
retrieved in step S304 from the speech segment dictionary 
112. In step S310, the CPU 100 decodes the speech segment 
data encoded by the 8 -bit fi -law scheme. 

In step S311, the CPU 100 checks whether speech segment 
data corresponding to all speech segments contained in the 
speech segment sequence obtained in step S304 are decoded. 
If all speech segment data are decoded, the flow advances 
to step S312. If speech segment data not decoded yet is 
present, the flow returns to step S304 to decode the next 
speech segment data. 

In step S312, on the basis of the prosody determined 
in step S303, the CPU 100 modifies and concatenates the 
decoded speech segments (i.e. , edits the. waveform) . In step 
S313, the CPU 100 outputs the synthetic speech obtained in 
step S312 from the loudspeaker of an output device 103. 
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In the speech synthesis algorithm of the first 
embodiment as described above, a desired speech segment can 
be decoded by a decoding method corresponding to the 7-bit 
M-law scheme or the 8-bit jjl -law scheme. With this 
5 arrangement, natural, high-quality synthetic speech can be 
generated. 

In the first embodiment, the aforementioned speech 
synthesis algorithm is realized on the basis of the program 
stored in the storage device 101. However, a part or the 
10 whole of this speech synthesis algorithm can also be 
constituted by hardware. 

[First Modification of the First Embodiment] 

In the first embodiment, speech segment data whose 
encoding distortion is larger than a predetermined threshold 

15 value is encoded by the 8-bit M~law scheme. However, it is 
also possible to obtain the encoding distortion after 
encoding is performed by the 8-bit M-law scheme, and register 
speech segment data whose encoding distortion is larger than 
a predetermined threshold value in a speech segment 

20 dictionary without encoding the data. With this arrangement, 
degradation of the quality of an unstable speech segment 
(e.g., a speech segment classified into a voiced fricative 
sound or a plosive) can be prevented. Also, natural, 
high-quality synthetic speech can be generated by using a 

25 speech segment dictionary thus formed. 

[Second Modification of the First Embodiment] 





In the first embodiment, an encoding method is selected 
from the 7-bit /x-law scheme and the 8-bit At -law scheme in 
accordance with the encoding distortion. However, it is also 
possible, in accordance with the type (e.g., a voiced 
5 fricative sound, plosive, nasal sound, some other voiced 
sound, or unvoiced sound) of speech segment, to choose to 
encode the speech segment by the 7-bit £i-law scheme or the 
8-bit //.-law scheme or to register the speech segment in the 
speech segment dictionary 112 without encoding it. For 

10 example, a speech segment of the type of a voiced fricative 
sound and plosive may be registered in the speech segment 
dictionary 112 without encoding it, and a speech segment of 
the type of nasal sound and unvoiced sound may be registered 
in the speech segment dictionary 112 by encoding with the 

15 7-bit //-law scheme, and a speech segment of the type of other 
voiced sound may be registered in the speech segment 
dictionary 112 by encoding with the 8-bit /x-law scheme. 
[ Second Embodiment ] 

A speech segment dictionary formation algorithm and 

20 a speech synthesis algorithm according to the second 

embodiment of the present invention will be described below 
by using the speech processing apparatus shown in Fig. 1. 

In the second embodiment, one of a plurality of 
encoding methods using different quantization code books is 

25 selected for each speech segment to be registered in a speech 
segment dictionary 112. Note that a speech segment to be 
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registered in the speech segment dictionary 112 is composed 
of a phoneme, semi -phoneme, diphone (e.g., CV or VC) , VCV 

(or CVC) , or combinations thereof. 

(Formation of speech segment dictionary) 

Fig. 4 is a flow chart for explaining the speech segment 
dictionary formation algorithm in the second embodiment of 
the present invention. A program for achieving this 
algorithm is stored in a storage device 101. A CPU 100 reads 
out this program from the storage device 101 on the basis 
of an instruction from a user and executes the following 
procedure. 

In step S401, the CPU 100 initializes an index i, which 
indicates each of N speech segment data (each speech segment 
data is non-compressed) stored in speech segment database 
111 of an external storage device 102, to "0". Note that this 
index i is stored in the storage device 101. 

In step S402, the CPU 100 reads out ith speech segment 
data Wi indicated by this index i. Assume that the readout 
data Wi is 

Wi = {x0, xl, . . . , xT-1} 
where T is the time length (in units of samples) of Wi . 

In step S403, the CPU 100 forms a scalar quantization 
code book Qi of the speech segment data Wi read out in step 
S402. More specifically, the CPU 100 decodes the encoded 
speech segment data Wi by using the scalar quantization code 
book Qi and so designs that a mean square error p of decoded 





data sequence Yi = {yO, yl,..., yT-1} is a minimum (i.e., 
the encoding distortion is a minimum) . In this case, an 
algorithm such as an LBG method is usable. With this 
arrangement, the distortion of the waveform of a speech 
5 segment produced by encoding can be minimized. Note that the 
mean square error p can be represented by 

p = (1/T) ■ 2 (xt - yt) 2 ... (2) 

where " E " is the summation from t=0tot=T-l. 

y In step S404, the CPU 100 writes the scalar 

ffk 

10 quantization code book Qi formed in step S403 and the like 

H in the speech segment dictionary 112. In addition to the 

Uj quantization code book Qi, the CPU 100 writes information 

yl 

necessary to decode the speech segment data Wi . In step S405, 
CO the CPU 100 encodes (scalar-quantizes) the speech segment 

L£ 15 data Wi by using the quantization code book Qi formed in step 

P S403. 

Assuming the code book Qi is 

Qi = { qO, ql, . . . , qN-1 } (N is the quantization step) , 
a code ct corresponding to xt (^Wi) can be represented by 
20 ct = argn min (xt - qn)' (0 ^ n < N) ...(3) 

In step S406, the CPU 100 writes speech segment data 
Ci (= {c0, cl, . . . , cT-1} encoded in step S405 into the speech 
segment dictionary 112. In step S407, the CPU 100 checks 
whether the above processing is performed for all of the N 
25 speech segment data. If i = N - 1, the CPU 100 completes this 
algorithm. If not, in step S408 the CPU 100 adds 1 to the 



index i, the flow returns to step S402, and the CPU 100 reads 
out speech segment data designated by the updated index i. 
The CPU 100 repeatedly executes this processing for all of 
the N speech segment data. 
5 In the speech segment dictionary formation algorithm 

of the second embodiment as described above, it is possible 
to form a quantization code book for each speech segment to 
be registered in the speech segment dictionary 112 and 
2 scalar-quantize the speech segment by using the formed 

f\ 10 quantization code book. With this arrangement, a storage 

^ capacity necessary for the speech segment dictionary can be 

JL; very efficiently reduced without deteriorating the quality 

^ of speech segments to be registered in the speech segment 

K dictionary. Also_, a larger number of types of speech 

M= 15 segments than in conventional speech segment dictionaries 

p can be registered in a speech segment dictionary having a 

storage capacity equivalent to those of the conventional 
dictionaries . 

In the second embodiment, the aforementioned speech 
20 segment dictionary formation algorithm is realized on the 

basis of the program stored in the storage device 101. 

However, a part or the whole of this speech segment dictionary 

formation algorithm can also be constituted by hardware. 

(Speech synthesis ) 
25 Fig. 5 is a flow chart for explaining the speech 

synthesis algorithm in the second embodiment of the present 




invention. A program for achieving this algorithm is stored 
in the storage device 101. The CPU 100 reads out this program 
on the basis of an instruction from a user and executes the 
following procedure. 

In step S501, the user inputs a character string in 
Japanese, English, or some other language by using the 
keyboard and the mouse of an input device 104. In the case 
of Japanese, the user inputs a character string expressed 
by kana-kanji mixed text. In step S502, the CPU 100 analyzes 
the input character string and obtains the speech segment 
sequence of this character string and parameters for 
determining the prosody of this character string. In step 
S503, on the basis of the prosodic parameters obtained in 
step S502, the CPU 100 determines prosody such as a duration 
length (the prosody for controlling the length of a voice) , 
fundamental frequency (the prosody for controlling the pitch 
of a voice), and power (the prosody for controlling the 
strength of a voice) . 

In step S504, the CPU 100 obtains an optimum speech 
segment sequence on the basis of the speech segment sequence 
obtained in step S502 and the prosody determined in step S503. 
The CPU 100 selects one speech segment contained in this 
speech segment sequence and retrieves a scalar quantization 
code book and speech segment data corresponding to the 
selected speech segment. If the speech segment dictionary 
112 is stored in a storage medium such as a hard disk, the 



• 



5 



5=4 



Lii 



s 

U 15 



20 



25 



CPU 100 sequentially seeks to storage areas of scalar 
quantization code books and speech segment data. If the 
speech segment dictionary 112 is stored in a storage medium 
such as a RAM, the CPU 100 sequentially moves a pointer 
(address register) to storage areas of scalar quantization 
code books and speech segment data. 

In step S505, the CPU 100 reads out the scalar 
quantization code book retrieved in step S504 from the speech 
segment dictionary 112. In step S506, the CPU 100 reads out 
the speech segment data retrieved in step S504 from the speech 
segment dictionary 112. In step S507, the CPU 100 decodes 
the speech segment data read out in step S506 by using the 
scalar quantization code book read out in step S505. 

In step S508, the CPU 100 checks whether speech segment 
data corresponding to all speech segments contained in the 
speech segment sequence obtained in step S504 are decoded. 
If all speech segment data are decoded, the flow advances 
to step S509. If speech segment data not decoded yet is 
present, the flow returns to step S504 to decode the next 
speech segment data. 

In step S509, on the basis of the prosody determined 
in step S503, the CPU 100 modifies and connects the decoded 
speech segments (i.e., edits the waveform). In step S510, 
the CPU 100 outputs the synthetic speech obtained in step 
S509 from the loudspeaker of an output device 103. 
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In the speech synthesis algorithm of the second 
embodiment as described above, a desired speech segment can 
be decoded using an optimum quantization code book for the 
speech segment. Accordingly, natural, high-quality 
synthetic speech can be generated. 

In the second embodiment, the aforementioned speech 
synthesis algorithm is realized on the basis of the program 
stored in the storage device 101. However, a part or the 
whole of this speech synthesis algorithm can also be 
constituted by hardware. 

[First Modification of the Second Embodiment] 

In the second embodiment, as in the first embodiment 
described previously, the number of bits (i.e., the number 
of quantization steps of scalar quantization) .per sample can 
be changed for each speech segment data. This can be 
accomplished by changing the procedures of the second 
embodiment as follows . That is, in the speech segment 
dictionary formation algorithm, the number of quantization 
steps is determined prior to the process (the write of the 
scalar quantization code book) in step S404 of Fig. 4. The 
determined number of quantization steps and the code book 
are recorded in the speech segment dictionary 112. In the 
speech synthesis algorithm, the number of quantization steps 
is read out from the speech segment dictionary 112 before 
the process (the read-out of the scalar quantization code 
book) in step S505. As in the first embodiment, the number 
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of quantization steps can be determined on the basis of the 
encoding distortion . 

[Second Modification of the Second Embodiment] 

In the speech synthesis algorithm of the second 
embodiment, in step S505 a scalar quantization code book 
formed for each speech segment data is selected. However, 
the present invention is not limited to this embodiment. For 
example, from a plurality of types of scalar quantization 
code books previously held by the speech segment dictionary 
112, a code book having the highest performance (i.e., by 
which the quantization distortion is a minimum) can also be 
chosen. 

[Third Modification of the Second Embodiment] 

In the second embodiment, a quantization code book is 
so designed that the encoding distortion is a minimum, and 
speech segment data is scalar-quantized by using the designed 
quantization code book. However, speech segment data whose 
encoding distortion is larger than a predetermined threshold 
value can also be registered in a speech segment dictionary 
without being encoded. With this arrangement, degradation 
of the quality of an unstable speech segment (e.g., a speech 
segment classified into a voiced fricative sound or a 
plosive) can be prevented. Also, natural, high-quality 
synthetic speech can be generated by using a speech segment 
dictionary thus formed. 
[Third Embodiment] 
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A speech segment dictionary formation algorithm and 
a speech synthesis algorithm according to the second 
embodiment of the present invention will be described below 
by using the speech processing apparatus shown in Fig. 1. 

In the above second embodiment, one of a plurality of 
encoding methods using different quantization code books is 
selected for each speech segment to be registered in a speech 
segment dictionary 112. In this third embodiment, however, 
one of a plurality of encoding methods using different 
quantization code books is selected for each of a plurality 
of speech segment clusters. Note that a speech segment to 
be registered in the speech segment dictionary 112 is 
composed of a phoneme, semi-phoneme, diphone (e.g., CV or 
VC) , VCV (or CVC) , or combinations thereof. 
(Formation of speech segment dictionary) 

Fig. 6 is a flow chart for explaining the speech segment 
dictionary formation algorithm in the third embodiment of 
the present invention. A program for achieving this 
algorithm is stored in a storage device 101. A CPU 100 reads 
out this program from the storage device 101 on the basis 
of an instruction from a user and executes the following 
procedure . 

In step S601, the CPU 100 reads out all of N speech 
segment data (each speech segment data is non-compressed) 
stored in speech segment database 111 of an external storage 
device 102. In step S602, the CPU 100 clusters all these 
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speech segments into a plurality of (M) speech segment 
clusters. More specifically, the CPU 100 forms M speech 
segment clusters in accordance with the similarity of the 
waveform of each speech segment. 

In step S603, the CPU 100 initializes index i which 
indicates each of the M speech segment clusters to "0". In 
step S604, the CPU 100 forms a scalar quantization code book 
Qi for ith speech segment cluster Li. In step S605, the CPU 
100 writes the code book Qi formed in step S604 into the speech 
segment dictionary 112. 

In step S606, the CPU 100 checks whether the above 
processing is performed for all of the M speech segment 
clusters. If i = M - 1 (the processing is completely 
performed for all of the M speech segment clusters) , the flow 
advances to step S608. If not, in step S607 the CPU 100 adds 
1 to the index i, the flow returns to step S604, and the CPU 
100 forms a scalar quantization code book for the next speech 
segment cluster. 

After scalar quantization code books are formed for 
all of the M speech segment clusters, this algorithm advances 
to step S608. In step S608, the CPU 100 initializes index 
i, which indicates each of the N speech segments stored in 
the speech segment database 111 of the external storage 
device 102, to "0" . In step S609, the CPU 100 selects a scalar 
quantization code book Qi for ith speech segment data Wi . 
This scalar quantization code book Qi selected is a 
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quantization code book corresponding to a speech segment 
cluster to which the speech segment data Wi belongs. 

In step S610, the CPU 100 writes information (code book 
information) designating the scalar quantization code book 
selected in step S609 and the like into the speech segment 
dictionary 112. In addition to the code book information, 
the CPU 100 writes information necessary to decode the speech 
segment data Wi. In step S611, the CPU 100 encodes the speech 
segment data Wi by using the code book Qi formed in step S604 . 
In step S612, the CPU 100 writes speech segment data Ci (= 
{c0, cl,..., cT-1} encoded in step S611 into the speech 
segment dictionary 112. 

In step S613, the CPU 100 checks whether the above 
processing is performed for all of the N speech segment data. 
If i = N - 1, the CPU 100 completes this algorithm. If not, 
in step S614 the CPU 100 adds 1 to the index i, the flow returns 
to step S609, and the CPU 100 forms a scalar quantization 
code book for the next speech segment data. 

In the speech segment dictionary formation algorithm 
of the third embodiment as described above, one of a plurality 
of encoding methods using different quantization code books 
can be selected for each of a plurality of speech segment 
clusters. This can reduce the number of quantization code 
books to be registered in the speech segment dictionary 112. 
With this arrangement, a storage capacity necessary for the 
speech segment dictionary can be very efficiently reduced 




without deteriorating the quality of speech segments to be 
registered in the speech segment dictionary. Also, a larger 
number of types of speech segments than in conventional 
speech segment dictionaries can be registered in a speech 
5 segment dictionary having a storage capacity equivalent to 
those of the conventional dictionaries. 

In the third embodiment, the aforementioned speech 
segment dictionary formation algorithm is realized on the 

O 

jj basis of the program stored in the storage device 101. 

Qj 10 However, a part or the whole of this speech segment dictionary 

J~ formation algorithm can also be constituted by hardware. 

^ (Speech synthesis) 

^ Fig. 8 is a flow chart for explaining the speech 

53" synthesis algorithm in the third embodiment of the present 

o 

H 15 invention. A program for achieving this algorithm is stored 

p in the storage device 101. The CPU 100 reads out this program 

on the basis of an instruction from a user and executes the 
following procedure. For the sake of simplicity, in this 
embodiment it is assumed that code books corresponding to 
20 all speech segment clusters are previously stored in the 
storage device 101. 

Steps S801 to 803 have the same functions and processes 
as in steps S501 to S503 of Fig. 5, so a detailed description 
thereof will be omitted. 
25 In step S804, the CPU 100 obtains an optimum speech 

segment sequence on the basis of a speech segment sequence 
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obtained in step S802 and prosody determined in step S803. 
The CPU 100 selects one speech segment contained in this 
speech segment sequence and retrieves code book information 
and speech segment data corresponding to the selected speech 
5 segment. If the speech segment dictionary 112 is stored in 
a storage medium such as a hard disk, the CPU 100 sequentially 
seeks to storage areas of code book information and speech 
segment data. If the speech segment dictionary 112 is stored 
in a storage medium such as a RAM, the CPU 100 sequentially 

10 moves a pointer (address register) to storage areas of code 
book information and speech segment data- 
in step S805, the CPU 100 reads out the code book 
information retrieved in step S804 and determines a speech 
segment cluster of this speech segment data and a scalar 

15 quantization code book corresponding to the speech segment 
cluster. In step S806, the CPU 100 looks up the speech 
segment dictionary 112 to obtain the scalar quantization code 
book determined in step S805. In step S807, the CPU 100 reads 
out the speech segment data retrieved in step S804 from the 

20 speech segment dictionary 112. In step S808, the CPU 100 
decodes the speech segment data read out in step S807 by using 
the scalar quantization code book obtained in step S806. 

In step S809, the CPU 100 checks whether speech segment 
data corresponding to all speech segments contained in the 

25 speech segment sequence obtained in step S804 are decoded. 
If all speech segment data are decoded, the flow advances 
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to step S810. If speech segment data not decoded yet is 
present, the flow returns to step S804 to decode the next 
speech segment data. 

In step S810, on the basis of the prosody determined 
5 in step S803, the CPU 100 modifies and connects the decoded 
speech segments (i.e., edits the waveform). In step S811, 
the CPU 100 outputs the synthetic speech obtained in step 
S810 from the loudspeaker of an output device 103. 

In the speech synthesis algorithm of the third 
10 embodiment as described above, a desired speech segment can 
be decoded using an optimum quantization code book for a 
speech segment cluster to which this speech segment belongs. 
Accordingly, natural, high-quality synthetic speech can be 
generated. 

15 In the third embodiment, the aforementioned speech 

synthesis algorithm is realized on the basis of the program 
stored in the storage device 101. However, a part or the 
whole of this speech synthesis algorithm can also be 
constituted by hardware. 

20 [First Modification of the Third Embodiment] 

In the speech segment dictionary formation algorithm 
of the third embodiment, the procedure of forming a speech 
segment cluster in accordance with the similarity of the 
waveform of a speech segment has been explained. However, 

25 it is also possible to form a speech segment cluster in 
accordance with the type (e.g., a voiced fricative sound, 




plosive, nasal sound, some other voiced sound, or unvoiced 
sound) of speech segment, and form a quantization code book 
for each speech segment cluster. 
[Second Modification of the Third Embodiment] 
5 In the speech synthesis algorithm of the third 

embodiment, in step S805 a scalar quantization code book 
formed for each speech segment cluster is selected. However, 
the present invention is not limited to this embodiment. For 
example, from a plurality of types of scalar quantization 

10 code books held by the speech segment dictionary 112, a code 
book having the highest performance (i.e., by which the 
quantization distortion is a minimum) can also be chosen. 
[Third Modification of the Third Embodiment] 

In the third embodiment, scalar quantization can also 

15 be performed by taking the gain (power) into consideration. 
That is, in step 609 a gain g of speech segment data is obtained 
prior to selecting a scalar quantization code book. In step 
S610, the obtained gain g and code book information are 
written in the speech segment dictionary 112. In step S611, 

20 quantization is performed by taking account of the gain g. 
This means that equation (3) presented earlier is replaced 
by 

ct = argn min (xt - g-qn) 2 (0 ^ n < N) 
Meanwhile, in step S808 (reference to a code book) of 
25 the speech synthesis algorithm, the value g obtained by the 





code book reference is multiplied by the gain g to yield a 
decoded value. 

[Fourth Modification of the Third Embodiment] 

In the third embodiment, an optimum quantization code 

5 book is designed for each speech segment cluster, and speech 

segment data belonging to each speech segment cluster is 

scalar-quantized by using the designed quantization code 

book. However, speech segment data found to increase the 

^fj encoding distortion can also be registered in a speech 

~f t \ 10 segment dictionary without being encoded. With this 

r\ arrangement, degradation of the quality of an unstable speech 

*L S segment (e.g., a speech segment classified into a voiced 

01 

!l fricative sound or a plosive) can be prevented. Also, 

83 natural, high-quality synthetic speech can be generated by 

M= 15 using a speech segment dictionary thus formed. 

Q [Fourth Embodiment] 

A speech segment dictionary formation algorithm and 
a speech synthesis algorithm according to the fourth 
embodiment of the present invention will be described below 
20 by using the speech processing apparatus shown in Fig. 1. 
In the fourth embodiment, a linear prediction 
coefficient and a prediction difference are calculated for 
each speech segment data, and the data is encoded by an optimum 
quantization code book for the calculated prediction 
25 difference. Note that a speech segment to be registered in 
the speech segment dictionary 112 is composed of a phoneme, 





semi -phoneme, diphone (e.g., CV or VC) , VCV (or CVC) , or 
combinations thereof . 

(Formation of speech segment dictionary) 

Fig. 9 is a flow chart for explaining the speech segment 
5 dictionary formation algorithm in the fourth embodiment of 
the present invention. A program for achieving this 
algorithm is stored in a storage device 101. A CPU 100 reads 
out this program from the storage device 101 on the basis 
of an instruction from a user and executes the following 
10 procedure. 

In step S901, the CPU 100 initializes an index i, which 
indicates each of N speech segment data (each speech segment 
data is non-compressed) stored in speech segment database 
111 of an external storage device 102, to "0" . In step.S902, 
15 the CPU 100 reads out speech segment data (a speech segment 
before encoding) Wi of the ith speech segment indicated by 
this index i. Assume that the readout data Wi is 

Wi = {x0, xl, . . . , xT-1} 
where T is the time length (in units of samples) of Wi . 
20 In step S903, the CPU 100 calculates a linear 

prediction coefficient and a prediction difference of the 
speech segment data Wi read out in step S902. Assuming the 
linear prediction order is order L, this linear prediction 
model is represented by using a linear prediction coefficient 
25 al and a prediction difference dt as 

xt = Zalxt-1 + dt ...(4) 




where E is the summation of 1 = 1 to L. 

Hence, the linear prediction coefficient al which 
minimizes the square-sum of the prediction difference dt 



is determined. In this expression, E is the summation of 
t = 1 to T - 1. 

In step S904, the CPU 100 writes the linear prediction 
coefficient al calculated in step S903 into the speech 
segment dictionary 112. In step S905, the CPU 100 forms a 
quantization code book Qi of the prediction difference dt 
calculated in step S903. More specifically, the CPU 100 
decodes the encoded prediction difference dt by using the 
quantization code book Qi and so designs that a mean square 
error p of decoded data sequence Ei = {el, el + 1,..., eT-1} 
is a minimum (i.e., the encoding distortion is a minimum) . 
In this case, an algorithm such as an LBG method is usable. 
With this arrangement, the distortion of the waveform of a 
speech segment produced by encoding can be minimized. Note 
that the mean square error p can be represented by 



where " 2 " is the summation of t = 0 to T - 1. 

In step S906, the CPU 100 writes the quantization code 
book Qi formed in step S905 and the like in the speech segment 
dictionary 112. In addition to the code book Qi, the CPU 100 
writes information necessary to decode the speech segment 
data Wi. In step S907, the CPU 100 encodes the speech segment 



2dt" 



. . . (5) 



p = (1/T) * £ (dt - et)' 



• . . (6) 




data Wi by linear predictive coding by using the linear 
prediction coefficient al calculated in step S903 and the 
code book Qi formed in step S905. Assuming the code book Qi 
is 

Qi = {qO, ql, . . . , qN-1} (N is the quantization step) , 
a code ct corresponding to xt (^Wi) can be represented by 
ct = argn min (xt - Zalyt-1 - qn) 2 (0 ^ n < N) 

... (7) 

where yt is the value obtained by encoding and then decoding 
xt by this method. 

In step S908, the CPU 100 writes speech segment data 
Ci (= {c0, cl, . . cT-1} encoded in step S907 into the speech 
segment dictionary 112. In step S909, the CPU 100 checks 
whether the above processing is performed for all of the N 
speech segment data. If i = N - 1, the CPU 100 completes this 
algorithm. If not, in step S910 the CPU 100 adds 1 to the 
index i, the flow returns to step S902, and the CPU 100 reads 
out speech segment data designated by the updated index i. 
The CPU 100 repeatedly executes this processing for all of 
the N speech segment data. 

In the speech segment dictionary formation algorithm 
of the fourth embodiment as described above, it is possible 
to calculate a linear prediction coefficient and a prediction 
difference for each speech segment to be registered in the 
speech segment dictionary 112, and encode the speech segment 
by an optimum quantization code book for the calculated 
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prediction difference. With this arrangement, a storage 
capacity necessary for the speech segment dictionary can be 
very efficiently reduced without deteriorating the quality 
of speech segments to be registered in the speech segment 
dictionary. Also, a larger number of types of speech 
segments than in conventional speech segment dictionaries 
can be registered in a speech segment dictionary having a 
storage capacity equivalent to those of the conventional 
dictionaries . 

In the fourth embodiment, the aforementioned speech 
segment dictionary formation algorithm is realized on the 
basis of the program stored in the storage device 101. 
However, a part or the whole of this speech segment dictionary 
formation algorithm can also be constituted by hardware. 
(Speech synthesis) 

Fig. 10 is a flow chart for explaining the speech 
synthesis algorithm in the fourth embodiment of the present 
invention. A program for achieving this algorithm is stored 
in the storage device 101. The CPU 100 reads out this program 
on the basis of an instruction from a user and executes the 
following procedure . 

In step S1001, the user inputs a character string in 
Japanese, English, or some other language by using the 
keyboard and the mouse of an input device 104. In the case 
of Japanese, the user inputs a character string expressed 
by kana-kanji mixed text . In step S1002, the CPU 100 analyzes 
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the input character string and obtains the speech segment 
sequence of this character string and parameters for 
determining the prosody of this character string. In step 
S1003, on the basis of the prosodic parameters obtained in 
step S1002, the CPU 100 determines prosody such as a duration 
length (the prosody for controlling the length of a voice) , 
the fundamental frequency (the prosody for controlling the 
pitch of a voice) , and the power (the prosody for controlling 
the strength of a voice) . 

In step S1004, the CPU 100 obtains an optimum speech 
segment sequence on the basis of the speech segment sequence 
obtained in step S1002 and the prosody determined in step 
S1003. The CPU 100 selects one speech segment contained in 
this speech segment sequence and retrieves a linear 
prediction coefficient, quantization code book, and 
prediction difference corresponding to the selected speech 
segment. If the speech segment dictionary 112 is stored in 
a storage medium such as a hard disk, the CPU 100 sequentially 
seeks to storage areas of linear prediction coefficients, 
quantization code books, and prediction differences. If the 
speech segment dictionary 112 is stored in a storage medium 
such as a RAM, the CPU 100 sequentially moves a pointer 
(address register) to storage areas of linear prediction 
coefficients, quantization code books, and prediction 
differences . 
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In step S1005, the CPU 100 reads out the prediction 
coefficient retrieved in step S1004 from the speech segment 
dictionary 112. In step S1006, the CPU 100 reads out the 
quantization code book retrieved in step S1004 from the 
5 speech segment dictionary 112. In step S1007, the CPU 100 
reads out the prediction difference retrieved in step S1004 
from the speech segment dictionary 112. In step S1008, the 
CPU 100 decodes the prediction difference by using the 
S3 prediction coefficient, the quantization code book, and the 

yj 10 decoded data of the immediately preceding sample, thereby 

£7j obtaining speech segment data. 

In step S1009, the CPU 100 checks whether speech 
^ segment data corresponding to all speech segments contained 

Jfz in the speech segment sequence obtained in step S1004 are 

^ 15 decoded- If all speech segment data are decoded, the flow 

D advances to step S1010. If speech segment data not decoded 

yet is present, the flow returns to step S1004 to decode the 
next speech segment data. 

In step S1010, on the basis of the prosody determined 
20 in step S1003, the CPU 100 modifies and connects the decoded 
speech segments (i.e., edits the waveform) . In step S1011, 
the CPU 100 outputs the synthetic speech obtained in step 
S1010 from the loudspeaker of an output device 103. 

In the speech synthesis algorithm of the fourth 
25 embodiment as described above, a desired speech segment can 
be decoded using an optimum quantization code book for the 




speech segment. Accordingly, natural, high-quality 
synthetic speech can be generated. 

In the fourth embodiment, the aforementioned speech 
synthesis algorithm is realized on the basis of the program 
5 stored in the storage device 101. However, a part or the 
whole of this speech synthesis algorithm can also be 
constituted by hardware. 

[First Modification of the Fourth Embodiment] 
yj In the fourth embodiment, as in the first embodiment 

Qj 10 described earlier, the number of bits (i.e., the number of 

Hi quantization steps) per sample can be changed for each speech 

%l segment data. This can be accomplished by changing the 

procedures of the fourth embodiment as follows. That is, in 
Jff the speech segment dictionary formation algorithm, the 

L-i 

15 number of quantization steps is determined prior to the 
D process (the write of the quantization code book) in step 

S905. The determined number of quantization steps and the 
code book are recorded in the speech segment dictionary 112. 
In the speech synthesis algorithm, the number of quantization 

20 steps is read out from the speech segment dictionary 112 
before the process (the read-out of the quantization code 
book) in step S1006. As in the first embodiment, the number 
of quantization steps can be determined on the basis of the 
encoding distortion . 

25 [Second Modification of the Fourth Embodiment] 
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In the fourth embodiment, the linear prediction order 
L can also be change for each speech segment data. This can 
be accomplished by changing the procedures of the fourth 
embodiment as follows. That is, in the speech segment 
dictionary formation algorithm, the prediction order is set 
prior to the process (the write of the prediction 
coefficient) in step S904. The set prediction order and the 
prediction coefficient are recorded in the speech segment 
dictionary 112. In the speech synthesis algorithm, the 
prediction order is read out from the speech segment 
dictionary 112 before the process (the read-out of the 
prediction coefficient) in step S1005. As in the first 
embodiment, this prediction order can be determined on the 
basis of the encoding distortion. 
[Third Modification of the Fourth Embodiment] 

In the fourth embodiment, the encoding performance of 
the quantization code book formed in step S905 can be further 
improved. This is so because while in step S905 the code book 
is optimized for the prediction difference dt, in step S907 
the quantization code book is referred to with respect to 

xt - Salyt-1 (* dt = xt - Zalxt-1) ...(8) 
An AbS (Analysis by Synthesis) method or the like can be used 
as an algorithm for updating this code book. In this 
expression, 2 is the summation of 1 = 1 to L. 
[Fourth Modification of the Fourth Embodiment] 




In the fourth embodiment, one quantization code book 
is designed for one speech segment data. However, one 
quantization code book can also be designed for a plurality 
of speech segment data. For example, as in the third 
5 embodiment, it is pos-sible to cluster N speech segment data 
into M speech segment clusters and design a quantization code 
book for each speech segment cluster. 
[Fifth Modification of the Fourth Embodiment] 
J2 In the fourth, embodiment, data of L samples from the 

U 10 beginning of speech segment data can be directly written in 

fj the speech segment dictionary 112 without being encoded. 

JJJ This makes it possible to avoid a phenomenon in which linear 

jL prediction cannot be well performed for L samples from the 

Jjf _ beginning of speech segment data. 

M 6 15 [Sixth Modification of the Fourth Embodiment] 

a — I 

O In the fourth embodiment, in step S907 the code ct that 

is optimum for xt is obtained. However, this optimum code 
ct can also be obtained by taking account of m samples after 
xt . This can be realized by temporarily determining the code 
20 ct and recursively searching for the code ct (searching the 
tree structure) . 

[Seventh Modification of the Fourth Embodiment] 

In the fourth embodiment, a quantization code book is 
so designed that the encoding distortion is a minimum, and 
25 speech segment data is linearly encoded by using the designed 
quantization code book. However, speech segment data whose 
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encoding distortion is larger than a predetermined threshold 
value can be registered in a speech segment dictionary 
without being encoded. With this arrangement, degradation 
of the quality of an unstable speech segment (e.g., a speech 
5 segment classified into a voiced fricative sound or a 
plosive) can be prevented. Also, natural, high-quality 
synthetic speech can be generated by using a speech segment 
dictionary thus formed. 
[Fifth Embodiment] 

10 A speech segment dictionary formation algorithm and 

a speech synthesis algorithm according to the fifth 
embodiment of the present invention will be described below 
by using the speech processing apparatus shown in Fig. 1. 

In the fifth embodiment, the various encoding schemes 

15 used in the previous embodiments are combined, and an optimum 
encoding method is selected for each speech segment data to 
be registered in a speech segment dictionary 112. In this 
fifth embodiment, an unstable speech segment (e.g., a speech 
segment classified into a voiced fricative sound or a 

20 plosive) is processed without being compressed. Note that 
a speech segment to be registered in the speech segment 
dictionary 112 is composed of a phoneme, semi -phoneme, 
diphone (e.g., CV or VC) , VCV (or CVC) , or combinations 
thereof. 

25 (Formation of speech segment dictionary) 
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Fig. 11 is a flow chart for explaining the speech 
segment dictionary formation algorithm in the fifth 
embodiment of the present invention. A program for achieving 
this algorithm is stored in a storage device 101. A CPU 100 
5 reads out this program from the storage device 101 on the 
basis of an instruction from a user and executes the following 
procedure. 

In step S1101, the CPU 100 initializes an index i, which 
indicates each of N speech segment data (each speech segment 
10 data is non-compressed) stored in speech segment database 
111 of an external storage device 102, to "0". Note that this 
index i is stored in the storage device 101. 

In step S1102, the CPU 100 reads out ith speech segment 
data Wi indicated by this index i. Assume that the readout 
15 data Wi is 

Wi = {x0, xl, . . . , xT-1} 
where T is the time length (in units of samples) of Wi . 

In step S1103, the CPU 100 encodes the speech segment 
data Wi read out in step S1102 by using the encoding scheme 
20 (i.e., linear predictive coding) explained in the fourth 
embodiment . 

In step S1104, the CPU 100 calculates encoding 
distortion p by this encoding scheme. In step S1105, the 
CPU 100 checks whether the encoding distortion p calculated 
25 in step S1104 is larger than a predetermined threshold value 
P0. If p > p0, the flow advances to step S1108, and the 
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CPU 100 encodes the speech segment data Wi by using another 
encoding scheme. If p > p 0 does not hold, the flow advances 
to step S1106. 

In step S1106, the CPU 100 writes encoding information 
5 of the speech segment data Wi in the speech segment dictionary 
112. This encoding information contains information 
specifying the encoding method by which the speech segment 
data Wi is encoded and information necessary to decode the 
gg speech segment data Wi (e.g., a prediction coefficient and 

Qj 10 a quantization code book) . In step S1107, the CPU 100 writes 

G 

§7j the speech segment data Wi encoded in step S1103 into the 

=k : speech segment dictionary 112, and the flow advances to step 

!L S1120. 

yf On the other hand, in step S1108 the CPU 100 encodes 

H- 15 the speech segment data Wi read out in step S1102 by using 

Q the encoding scheme (i.e., the 7 -bit M~law scheme or the 

8-bit jLt-law scheme) explained in the first embodiment. 

In step S1109, the CPU 100 calculates encoding 
distortion p by this encoding scheme. In step S1110, the 
20 CPU 100 checks whether the encoding distortion p calculated 
in step S1109 is larger than a predetermined threshold value 
pi. If p > pi, the flow advances to step S1113, and the 
CPU 100 encodes the speech segment data Wi by using another 
encoding scheme. If p > pi does not hold, the flow advances 
25 to step S111L 





In step Sllll, the CPU 100 writes encoding information 
of the speech segment data Wi in the speech segment dictionary 
112. This encoding information contains information 
specifying the encoding method by which the speech segment 
5 data Wi is encoded and information necessary to decode the 
speech segment data Wi. In step S1112, the CPU 100 writes 
the speech segment data Wi encoded in step SI 108 into the 
speech segment dictionary 112, and the flow advances to step 
S1120. 

10 On the other hand, in step S1113 the CPU 100 encodes 

the speech segment data Wi read out in step S1102 by using 
the encoding scheme (i.e., scalar quantization) explained 
in the second or third embodiment. 

In step S1114, the CPU 100 calculates encoding 

15 distortion p by this encoding scheme. In step S1115, the 
CPU 100 checks whether the encoding distortion p calculated 
in step S1114 is larger than a predetermined threshold value 
p2. For example, the waveform of a strongly unstable speech 
segment (e.g., a speech segment classified into a voiced 

20 fricative sound or a plosive) largely varies, sop > p 2 does 
not hold. If p > p2, the flow advances to step S1118. If 
p > p2 does not hold, the flow advances to step S1116. 

In step S1116, the CPU 100 writes encoding information 
of the speech segment data Wi in the speech segment dictionary 

25 112. This encoding information contains information 

specifying the encoding method by which the speech segment 
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data Wi is encoded and information necessary to decode the 
speech segment data Wi (e.g., a quantization code book) . In 
step S1117, the CPU 100 writes the speech segment data Wi 
encoded in step S1113 into the speech segment dictionary 112, 
5 and the flow advances to step S1120. 

On the other hand, in step S1118 the CPU 100 writes 
encoding information of the speech segment data Wi read out 
in step S1102 into the speech segment dictionary 112 without 

vG compressing the speech segment data Wi . This encoding 

CP 

Ly 10 information contains information indicating that the speech 

fy segment data Wi is not encoded. In step S1119, the CPU 100 

writes this speech segment data Wi in the speech segment 
dictionary 112, and the flow advances to step S1120. With 
this arrangement, deterioration of the. quality of an unstable 
15 speech segment can be prevented. 

In step S1120, the CPU 100 checks whether the above 
processing is performed for all of the N speech segment data. 
If i = N - 1, the CPU 100 completes this algorithm. If not, 
in step S1121 the CPU 100 adds 1 to the index i, the flow 
20 returns to step S1102, and the CPU 100 reads out speech segment 
data designated by the updated index i. The CPU 100 
repeatedly executes this processing for all of the N speech 
segment data. 

In the speech segment dictionary formation algorithm 
25 of the fifth embodiment as described above, an encoding 
scheme can be selected from the fi -law scheme, scalar 
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quantization, and linear predictive coding for each speech 
segment to be registered in the speech segment dictionary 
112. With this arrangement, a storage capacity necessary for 
the speech segment dictionary can be very efficiently reduced 
without deteriorating the quality of speech segments to be 
registered in the speech segment dictionary. Also, a larger 
number of types of speech segments than in conventional 
speech segment dictionaries can be registered in a speech 
segment dictionary having a storage capacity equivalent to 
those of the conventional dictionaries. 

In the fifth embodiment, the aforementioned speech 
segment dictionary formation algorithm is realized on the 
basis of the program stored in the storage device 101. 
However, a pari or the whole of this speech segment dictionary 
formation algorithm can also be constituted by hardware. 
(Speech synthesis ) 

Fig. 12 is a flow chart for explaining the speech 
synthesis algorithm in the fifth embodiment of the present 
invention. A program for achieving this algorithm is stored 
in the storage device 101. The CPU 100 reads out this program 
on the basis of an instruction from a user and executes the 
following procedure. 

In step S1201, the user inputs a character string in 
Japanese, English, or some other language by using the 
keyboard and the mouse of an input device 104. In the case 
of Japanese, the user inputs a character string expressed 



by kana-kanji mixed text . In step S1202, the CPU 100 analyzes 
the input character string and obtains the speech segment 
sequence of this character string and parameters for 
determining the prosody of this character string. In step 
S1203, on the basis of the prosodic parameters obtained in 
step S1202, the CPU 100 determines prosody such as a duration 
length (the prosody for controlling the length of a voice) , 
fundamental frequency (the prosody for controlling the pitch 
of a voice) , and power (the prosody for controlling the 
strength of a voice) . 

In step S1204, the CPU 100 obtains an optimum speech 
segment sequence on the basis of the speech segment sequence 
obtained in step S1202 and the prosody determined in step 
S1203. The CPU 100 selects one speech segment contained in 
this speech segment sequence and retrieves speech segment 
data and encoding information corresponding to the selected 
speech segment. If the speech segment dictionary 112 is 
stored in a storage medium such as a hard disk, the CPU 100 
sequentially seeks to storage areas of speech segment data 
and encoding information. If the speech segment dictionary 
112 is stored in a storage medium such as a RAM, the CPU 100 
sequentially moves a pointer (address register) to storage 
areas of speech segment data and encoding information. 

In step S1205, the CPU 100 reads out the encoding 
information retrieved in step S1204 from the speech segment 
dictionary 112. In step S1206, the CPU 100 reads out the 





speech segment data retrieved in step S1204 from the speech 
segment dictionary 112. 

In step S1207, on the basis of the encoding information 
read out in step S1205, the CPU 100 checks whether the speech 
5 segment data read out in step S1206 is encoded. If the data 
is encoded, the flow advances to step S1208 to specify the 
encoding method. If the data is not encoded, the flow 
advances to step S1215. 

O 

o3 In step S1208, on the basis of the encoding information 

Hi 10 read out in step S1205, the CPU 100 examines the encoding 

fTj method of the speech segment data read out in step S1206. 

JL: If the encoding method is linear predictive coding, the flow 

!1 advances to step S1212 to decode the data. In other cases, 

the flow advances to step S1209. 
15 In step S1209, on the basis of the encoding information 

□ read out in step S1205, the CPU 100 examines the encoding 

method of the speech segment data read out in step S1206. 
If the encoding method is the M~law scheme, the flow advances 
to step S1213 to decode the data. In other cases, the flow 
20 advances to step S1210. 

In step S1210, on the basis of the encoding information 
read out in step S1205, the CPU 100 examines the encoding 
method of the speech segment data read out in step S1206. 
If the encoding method is scalar quantization, the flow 
25 advances to step S1214 to decode the data. In other cases, 
the flow advances to step S1211 . 





In step S1211, the CPU 100 checks whether speech 
segment data corresponding to all speech segments contained 
in the speech segment sequence obtained in step S1204 are 
decoded. If all speech segment data are decoded, the flow 
5 advances to step S1215. If speech segment data not decoded 
yet is present, the flow returns to step S1204 to decode the 
next speech segment data. 

In step S1215, on the basis of the prosody determined 

O 

%0 in step S1203, the CPU 100 modifies and connects the decoded 

y 10 speech segments (i.e., edits the waveform) . In step S1216, 

yj the CPU 100 outputs the synthetic speech obtained in step 

fp S1215 from the loudspeaker of an output device 103. 

^ In the speech synthesis algorithm of the fifth 

^ embodiment as described above, a desired speech segment can 

j~ 15 be decoded by a decoding method corresponding to one of the 

^ M-law scheme, scalar quantization, and linear predictive 

coding. Therefore, natural, high-quality synthetic speech 

can be generated. 

In the fifth embodiment, the aforementioned speech 
20 synthesis algorithm is realized on the basis of the program 

stored in the storage device 101. However, a part or the 

whole of this speech synthesis algorithm can also be 

constituted by hardware. 

[Sixth Embodiment] 
25 A speech segment dictionary formation algorithm and 

a speech synthesis algorithm according to the sixth 





embodiment of the present invention will be described below 
by using the speech processing apparatus shown in Fig. 1. 

In the above fifth embodiment, an optimum encoding 
method is selected from a plurality of encoding methods using 
5 different encoding schemes for each speech segment data to 
be registered in a speech segment dictionary 112. In the 
sixth embodiment, however, an optimum encoding method is 
chosen from a plurality of encoding methods using different 
encoding schemes in accordance with the type of speech 
10 segment data. Note that a speech segment to be registered 
in the speech segment dictionary 112 is constructed of a 
phoneme, semi -phoneme, diphone (e.g., CVorVC), VCV (or CVC) , 
or combinations thereof. 

(Formation of speech segment dictionary) 

15 Fig. 13 is a flow chart for explaining the speech 

segment dictionary formation algorithm in the sixth 
embodiment of the present invention. A program for achieving 
this algorithm is stored in a storage device 101. A CPU 100 
reads out this program from the storage device 101 on the 

20 basis of an instruction from a user and executes the following 
procedure . 

In step S1301, the CPU 100 initializes an index i, which 
indicates each of N speech segment data (each speech segment 
data is non-compressed) stored in speech segment database 
25 111 of an external storage device 102, to "0". Note that this 
index i is stored in the storage device 101. 





In step S1302, the CPU 100 reads out ith speech segment 
data Wi indicated by this index i. Assume that the readout 
data Wi is 

Wi = {x0, xl, . . . , xT-1} 
5 where T is the time length (in units of samples) of Wi. 

In step S1303, the CPU 100 discriminates the type of 
the speech segment data Wi read out in step S1302. More 
specifically, the CPU 100 checks whether the type of the 
speech segment data Wi is a voiced fricative sound, plosive, 
10 unvoiced sound, nasal sound, or some other voiced sound. 

If the type of Ahe speech segment data Wi is a voiced 
fricative sound or plosYlve, the flow advances to step S1316. 
In step S1316, the CPU \l00 does not compress this speech 
segment data Wi . With this arrangement, degradation of the 
15 guality of the voiced fricative sound or plosive can be 
prevented. In step S1316\ the CPU 100 writes encoding 
information of the speech segment data Wi in the speech 
segment dictionary 112. This encoding information contains 
the type of the speech segment data Wi and information 
20 indicating that the speech sebment data Wi is not encoded. 
In step S1317, the CPU 100 writes the speech segment data 
Wi in the speech segment dictionary 112 without encoding the 
speech segment data Wi, and the fliLow advances to step S1318. 

If the type of Vhe speech segment data is an unvoiced 
25 sound, the flow advances to step S1306. In step S1306, the 
CPU 100 encodes the speech segment data Wi by using the 
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encoding scheme (i.e., scalar quantization) explained in the 
second or third embodiment .\ In step S1307, the CPU 100 writes 
encoding information of the\ speech segment data Wi in the 
speech segment dictionary 115£. This encoding information 
contains the type of the speech\segment data Wi, information 
specifying the encoding method by which the speech segment 
data Wi is encoded, and information necessary to decode the 
speech segment data Wi (e.g., a quantization code book) . In 
step S1308, the CPU 100 writes th4 speech segment data Wi 
encoded in step S1306 into the speech\segment dictionary 112, 
and the flow advances to step S1318> 

If the type of the speech segment data is a nasal sound, 
the flow advances to step S1310. In step S1310, the CPU 100 
encodes the speech segment data Wi by using the encoding 
scheme (i.e., linear predictive coding) explained in the 
fourth embodiment. In stiep S1311, the CPU 100 writes 
encoding information of the speech segment data Wi in the 
speech segment dictionary U12. This encoding information 
contains the type of the speech segment data Wi, information 
specifying the encoding method by which the speech segment 
data Wi is encoded, and information necessary to decode the 
speech segment data Wi (e.g., a prediction coefficient and 
a quantization code book) . In step S1312, the CPU 100 writes 
the speech segment data Wi encoded in step S1310 into the 
speech segment dictionary 112, an^ the flow advances to step 
S1318. 
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If the type of the speech segment data Wi is some other 
voiced sound, the flow advances to step S1313. In step S1313, 
the CPU 100 encodes the speech segment data Wi by using the 
encoding scheme (i.e., the 7 -bit u -law scheme or the 8-bit 
5 M-law scheme) explained in the first embodiment. In step 
S1314, the CPU 100 writes encoding information of the speech 
segment data Wi in the speech segment dictionary 112. This 
encoding information contains the type of the speech segment 

a 

%0 data Wi, information specifying the encoding method by which 

fy 10 the speech segment data Wi is encoded, and information 

j7j necessary to decode the speech segment data Wi . In step S1315, 

^ the CPU 100 writes the speech segment data Wi encoded in step 

!L S1313 into the speech segment dictionary 112, and the flow 

iff advances to step S1318. 

^ 15 In step S1318, the CPU 100 checks whether the above 

□ processing is performed for all of the N speech segment data. 

If i = N - 1, the CPU 100 completes this algorithm. If not, 
in step S1319 the CPU 100 adds 1 to the index i, the flow 
returns to step S1302, and the CPU 100 reads out speech segment 
20 data designated by the updated index i. The CPU 100 

repeatedly executes this processing for all of the N speech 
segment data. 

In the speech segment dictionary formation algorithm 
of the sixth embodiment as described above, an encoding 
25 scheme can be selected from the fx -law scheme, scalar 

quantization, and linear predictive coding in accordance 




with the type of speech segment to be registered in the speech 
segment dictionary 112. With this arrangement, a storage 
capacity necessary for the speech segment dictionary can be 
very efficiently reduced without deteriorating the quality 
5 of speech segments to be registered in the speech segment 
dictionary. Also, a larger number of types of speech 
segments than in conventional speech segment dictionaries 
can be registered in a speech segment dictionary having a 
storage capacity equivalent to those of the conventional 

10 dictionaries. 

In the sixth embodiment, the aforementioned speech 
segment dictionary formation algorithm is realized on the 
basis of the program stored in the storage device 101. 
However, a part or the whole of this speech segment dictionary 

15 formation algorithm can also be constituted by hardware. 
(Speech synthesis) 

Fig. 14 is a flow chart for explaining the speech 
synthesis algorithm in the sixth embodiment of the present 
invention. A program for achieving this algorithm is stored 

20 in the storage device 101. The CPU 100 reads out this program 
on the basis of an instruction from a user and executes the 
following procedure . 

Steps S1401 to S1403 have the same functions and 
processes as in steps S1201 to S1203 of Fig. 12, so a detailed 

25 description thereof will be omitted. 




In step S1404, the CPU 100 obtains an optimum speech 
segment sequence on the basis of a speech segment sequence 
obtained in step S1402 and prosody determined in step S1403. 
The CPU 100 selects one speech segment contained in this 
speech segment sequence and retrieves speech segment data 
and encoding information corresponding to the selected 
speech segment. If the speech segment dictionary 112 is 
stored in a storage medium such as a hard disk, the CPU 100 
sequentially seeks to storage areas of speech segment data 
and encoding information. If the speech segment dictionary 
112 is stored in a storage medium such as a RAM, the CPU 100 
sequentially moves a pointer (address register) to storage 
areas of speech segment data and encoding information. 

In step S1405, the CPU 100 reads out the encoding 
information retrieved in step S1404 from the speech segment 
dictionary 112. In step S1406, the CPU 100 reads out the 
speech segment data retrieved in step S14 04 from the speech 
segment dictionary 112. 

In step S1406, on the basis of the encoding information 
read out in step S1405, the CPU 100 discriminates the type 
of the speech segment data retrieved in step S1404. More 
specifically, the CPU 100 checks whether the type of the 
speech segment data is a voiced fricative sound, plosive, 
unvoiced sound, nasal sound, or some other voiced sound. 

If the type of the ^peech segment data is a voiced 
fricative sound or plosive, Ythe flow advances to step S1416. 
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In step S1416, the CPU 100 reads out the speech segment data 
retrieved in step S1404, and the\flow advances to step S1417. 
In this case, this speech segment data is not encoded. 

If the type of the speech segment data is an unvoiced 
sound, the flow advances to step S1414. In step S1414, the 
CPU 100 reads out the speech segment data retrieved in step 
S1404, and the flow advantes to step S1415. This speech 
segment data is encoded byy scalar quantization. In step 
S1415, the CPU 100 decodes this speech segment data on the 
basis of the encoding information read out in step S1405. 

If the type of tne speech segment data is a nasal sound, 
the flow advances to sttep S1412. In step S1412, the CPU 100 
reads out the speech segment data retrieved in step S1404, 
and the flow advances to sttep S1413. This speech segment data 
is encoded by linear predictive coding. In step S1413, the 
CPU 100 decodes this speech\segment data on the basis of the 
encoding information read out in step S1405. 

If the type of the speech segment data is some other 
voiced sound, the flow advances to step S1410 . In step S1410, 
the CPU 100 reads out the speech segment data retrieved in 
step S1404, and the flow advances to step S1411. This speech 
segment data is encoded by the jti-law scheme. In step S1411, 
the CPU 100 decodes this speech segment data on the basis 
of the encoding information read out in step S1405. 

In step S1417, the CPU 100 checks whether speech 
segment data corresponding to all speech segments contained 





in the speech segment sequence obtained in .step S1404 are 
decoded. If all speech segment data are decoded, the flow 
advances to step S1418. If speech segment data not decoded 
yet is present, the flow returns to step S1404 to decode the 
5 next speech segment data. 

In step S1418, on the basis of the prosody determined 
in step S1403, the CPU 100 modifies and connects the decoded 
speech segments (i.e., edits the waveform) . In step S1419, 
the CPU 100 outputs the synthetic speech obtained in step 

10 S1418 from the loudspeaker of an output device 103. 

In the speech synthesis algorithm of the sixth 
embodiment as described above, a desired speech segment can 
be decoded by a decoding method corresponding to one of the 
At -law scheme, scalar quantization, and linear predictive 

15 coding. With this arrangement, natural, high-quality 
synthetic speech can be generated. 

In the sixth embodiment, the aforementioned speech 
synthesis algorithm is realized on the basis of the program 
stored in the storage device 101. However, a part or the 

20 whole of this speech synthesis algorithm can also be 
constituted by hardware. 
[Other Embodiments ] 

In the second, fourth, and fifth embodiments described 
above, scalar quantization is used as the method of 

25 quantization. However, vector quantization can also be 
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applied by regarding a plurality of consecutive samples as 
one vector. 

Also, it is possible to divide an unstable speech 
segment such as a plosive into two portions before and after 
the plosion and encode these two portions by their respective 
optimum encoding methods. This can further improve the 
encoding efficiency of an unstable speech segment. 

The fourth embodiment has been explained on the basis 
of a linear prediction model . However, some other vocal cord 
filter model is also applicable. For example, an LMA (Log 
Magnitude Approximation) filter coefficient can be used in 
place of a linear prediction coefficient, and model 
parameters can be calculated by using the residual error of 
this LMA filter instead of a prediction difference. With 
this arrangement, the fourth embodiment can be applied to 
the cepstrum domain. 

Each of the above embodiments is applicable to a system 
comprising a plurality of devices (e.g., a host computer, 
interface device, reader, and printer) or to an apparatus 
(e.g., a copying machine or facsimile apparatus) comprising 
a single device . 

In each of the above embodiments, on the basis of 
instructions by program codes read out by the CPU 100, an 
operating system (OS) or the like running on the CPU 100 can 
execute a part or the whole of actual processing. 
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Furthermore, in each of the above embodiments, program 
codes read out from the storage device 101 are written in 
a memory of a function extension unit connected to the CPU 
100, and a CPU or the like of this function extension unit 
5 executes a part or the whole of actual processing on the basis 
of instructions by the program codes. 

In each of the embodiments as described above, an 
encoding method can be selected for each speech segment data, 
^ Therefore, a storage capacity necessary for the speech 

f ; 10 segment dictionary can be very efficiently reduced without 

C! deteriorating the quality of speech segments to be registered 

Uj in the speech segment dictionary. Also, natural, 

=^ high-quality synthetic speech can be generated by using the 

m speech segment dictionary thus formed. 

M, 15 The present invention is not limited to the above 

□ embodiments and various changes and modifications can be made 

within the spirit and scope of the present invention. 
Therefore, to apprise the public of the scope of the present 
invention, the following claims are made. 



